Planning for SQL Server Data Replication
You must consider many factors
when choosing a method to distribute data. Your business requirements
determine which is the right method for you. In general, you need to
understand the timing and latency of your data, its independence at each
site, and your specific need to filter or partition the data.
Autonomy, Timing, and Latency of Data
Distributed data implementations
can be accomplished using a few different facilities in Microsoft:
Integration Services (IS), Distributed Transaction Coordinator (DTC),
and Data Replication. The trick is to match the right facility to the
type of data distribution you need to get done.
In some applications, such as
online transaction processing and inventory control systems, data must
be synchronized at all times. This requirement, called immediate transactional consistency, was known as tight consistency in previous versions of SQL Server.
SQL Server implements immediate transactional consistency data distribution in the form of two-phase commit processing. A two-phase commit, sometimes known as 2PC,
ensures that transactions are committed on all servers, or the
transaction is rolled back on all servers. This ensures that all data on
all servers is 100% in sync at all times. One of the main drawbacks of
immediate transactional consistency is that it requires a high-speed LAN
to work. This type of solution might not be feasible for large
environments with many servers because occasional network outages can
occur. These types of implementations can be built with DTC and IS.
In other applications, such
as decision support and report generation systems, 100% data
synchronization all the time is not terribly important. This
requirement, called latent transactional consistency, was known as loose consistency in previous versions of SQL Server.
Latent transactional
consistency is implemented in SQL Server via data replication.
Replication allows data to be updated on all servers, but the process is
not a simultaneous one. The result is “real-enough-time” data. This is
known as latent transactional consistency because a lag exists between
the data updated on the main server and the replicated data.
In this scenario, if you could stop all data modifications from
occurring on all servers, all the servers would eventually have the same
data. Unlike the two-phase consistency model, replication works over
both LANs and WANs, as well as slow or fast links.
When planning a distributed application, you must consider the effect of one site’s operation on another. This is known as site autonomy.
A site with complete autonomy can continue to function without being
connected to any other site. A site with no autonomy cannot function
without being connected to all other sites. For example, applications
that utilize two-phase commits rely on all other sites being able to
immediately accept changes sent to them. In the event that any one site
is unavailable, no transactions on any server can be committed. In
contrast, sites using merge replication can be completely disconnected
from all other sites and continue to work effectively, not guaranteeing
data consistency. Luckily, some solutions combine both high data
consistency and site autonomy.
Methods of Data Distribution
After you have
determined the amount of transactional latency and site autonomy needed,
based on your business requirements, you need to select the data
distribution method that corresponds. Each different type of data
distribution has a different amount of site autonomy and latency. With
these distributed data systems, you can choose from several methods:
Distributed transactions—
Distributed transactions ensure that all sites have the same data at
all times. You pay a certain amount of overhead cost to maintain this
consistency. (We do not discuss this nondata replication method here.)
Transactional replication with updating subscribers—
Users can change data at the local location, and those changes are
applied to the source database at the same time. The changes are then
eventually replicated to other sites. This type of data distribution
combines replication and distributed transactions because data is
changed at both the local site and source database.
Peer-to-peer replication—
A variation on the Transactional replication with updating subscribers
theme is peer-to-peer replication, which is essentially full
transactional replication between two (or more) sites, but is
publisher-to-publisher (not update subscriber). There is no
hierarchy—publisher (parent) and subscriber (child).
Transactional replication—
With transactional replication, data is changed only at the source
location and is sent out to the subscribers. Because data is changed at
only a single location, conflicts cannot occur.
Snapshot replication with updating subscribers—
This method is much like transactional replication with updating
subscribers; users can change data at the local location, and those
changes are applied to the source database at the same time. The entire
changed publication is then replicated to all subscribers. This type of
replication provides higher autonomy than transactional replication.
Snapshot replication— A complete copy of the publication is sent out to all subscribers. This includes both changed and unchanged data.
Merge replication—
All sites make changes to local data independently and then update the
publisher. It is possible for conflicts to occur, but they can be
resolved.
SQL Server Replication Types
Microsoft has narrowed
the field to three major types of data replication approaches within SQL
Server: snapshot, transactional, and merge. Each replication type
applies to only a single publication. However, it is possible to have
multiple replication types per database.
Snapshot Replication
Snapshot replication makes an
image of all the tables in a publication at a single moment in time and
then moves that entire image to the subscribers. Little overhead on the
server is incurred because snapshot replication does not track data
modifications as the other forms of replication do. It is possible,
however, for snapshot replication to require large amounts of network
bandwidth, especially if the articles being replicated are large.
Snapshot replication is the easiest form of replication to set up and is
used primarily with smaller tables for which subscribers do not have to
perform updates. An example of this might be a phone list that is to be
replicated to many subscribers. This phone list is not considered to be
critical data, and the frequency of it being refreshed is more than
enough to satisfy all its users.
The primary agents used for snapshot replication are the snapshot agent and distribution agent.
The snapshot agent
creates files that contain the schema of the publication and the data.
The files are temporarily stored in the snapshot folder of the
distribution server, and then the distribution jobs are recorded in the
distribution database.
The distribution agent is responsible for moving the schema and data from the distributor to the subscribers.
A few other agents are also
used; they deal with other needed tasks for replication, such as cleanup
of files and history. In snapshot replication, after the snapshot has
been delivered to all the subscribers, these agents delete the
associated .bcp and .sch files from the distributor’s working directory.
Transactional Replication
Transactional
replication is the process of capturing transactions from the
transaction log of the published database and applying them to the
subscription databases. With SQL Server transactional replication, you
can publish all or part of a table, views, or one or more stored
procedures as an article. All data updates are then stored in the
distribution database and sent and applied to any number of subscribing
servers. Obtaining these updates from the publishing database’s
transaction log is extremely efficient. No direct reading of tables is
required except during initial snapshot, and only the minimal amount of traffic is generated over the network. This has made transactional replication the most often used method.
As data changes are made, they
are propagated to the other sites at nearly real-time; you determine
the frequency of this propagation. Because changes are usually made only
at the publishing server, data conflicts are avoided for the most part.
As an example, push subscribers usually receive updates from the
publisher in a minute or less, depending on the speed and availability
of the network. Subscribers also can be set up for pull subscriptions.
This capability is useful for disconnected users who are not connected
to the network at all times.
The primary agents used for transactional replication are the snapshot agent, log agent, and distribution agent:
The snapshot agent
creates files that contain the schema of the publication and the data.
The files are stored in the snapshot folder of the distribution server,
and the distribution jobs are recorded in the distribution database.
The
log reader agent monitors the transaction log of the database that it
is set up to service. Each database published has its own log reader
agent set up for replication, and it will copy the transactions from the
transaction log of that published database into the distribution
database.
The
distribution agent is responsible for moving the schema and data from
the distributor to the subscribers for the initial synchronization and
then moving all the subsequent transactions from the published database
to each subscriber as they come in. These transactions are stored in the
distribution database for a certain length of time and are eventually
purged.
A few other agents deal with
the other housekeeping issues surrounding data replication, such as
schema files cleanup, history cleanup, and transaction cleanup.
Merge Replication
Merge replication
involves getting the publisher and all subscribers initialized and then
allowing data to be changed at all sites involved in the merge
replication at the publisher and at all subscribers. All these changes
to the data are subsequently merged at certain intervals so that, again,
all copies of the database have identical data.
Occasionally, data conflicts
have to be resolved. The publisher does not always win in a conflict
resolution. Instead, the winner is determined by whatever criteria you
establish.
The primary agents used for merge replication are the snapshot agent and merge agent:
The snapshot
agent creates files that contain the schema of the publication and the
data. The files are stored in the snapshot folder of the distribution
server, and the distribution jobs are recorded in the distribution
database. This is essentially the same behavior as with all other types
of replication methods.
The
merge agent takes the initial snapshot and applies it to all the
subscribers. It then reconciles all changes made on all the servers,
based on the rules you configure.
Preparing for Merge Replication
When you set up a
table for merge replication, SQL Server performs three schema changes to
the database. First, it must either identify or create a unique column
for each row that will be replicated. This column is used to identify
the different rows across all the different copies of the table. If the
table already contains a column with the ROWGUIDCOL property, SQL Server automatically uses that column for the row identifier. If not, SQL Server adds a column called rowguid to the table. SQL Server also places an index on this rowguid column.
Next, SQL Server adds
triggers to the table to track changes that occur to the data in the
table and record them in the merge system tables. The triggers can track
changes at either the row or column level, depending on how you set it
up. SQL Server supports multiple triggers of the same type on a table,
so merge triggers do not interfere with user-defined triggers on the
table.
Finally, SQL Server adds new system tables to the database that contains the replicated tables. The MSMerge_contents and MSMerge_tombstone tables track the updates, inserts, and deletes. These tables rely on rowguid to track which rows have actually been changed.
The merge agent is
responsible for moving changed data from the site where it was changed
to all other sites in the replication scenario. When a row is updated,
the triggers added by SQL Server fire off and update the new system
tables, setting the generation column equal to 0 for the corresponding rowguid. When the merge agent runs, it collects the data from the rows where the generation column is 0
and then resets the generation values to values higher than the
previous generation numbers. This allows the merge agent to look for
data that has already been shared with other sites without having to
look through all the data. The merge agent then sends the changed data
to the other sites.
When the data reaches the
other sites, the data is merged with existing data according to rules
you have defined. These rules are flexible and highly extensible. The
merge agent evaluates existing and new data and resolves conflicts based
on priorities or which data was changed first. Another available option
is that you can create custom resolution strategies using the Component
Object Model (COM) and custom stored procedures. After conflicts have
been handled, synchronization occurs to ensure that all sites have the
same data.
The merge agent identifies conflicts using the MSMerge_contents table. In this table, a column called lineage is used to track the history of changes to a row. The agent updates the lineage
value whenever a user makes changes to the data in a row. The entry
into this column is a combination of a site identifier and the last
version of the row created at the site. As the merge agent is merging
all the changes that have occurred, it examines each site’s information
to see whether a conflict has occurred. If a conflict has occurred, the
agent initiates conflict resolution based on the criteria mentioned
earlier.